A Semantic Approach to Person Profile Extraction from Farsi Documents

نویسندگان

  • Hojjat Emami
  • Hossein Shirazi
  • Ahmad Abdollahzadeh Barforoush
چکیده

Entity profiling (EP) as an important task of Web mining and information extraction (IE) is the process of extracting entities in question and their related information from given text resources. From computational viewpoint, the Farsi language is one of the less-studied and less-resourced languages, and suffers from the lack of high quality language processing tools. This problem emphasizes the necessity of developing Farsi text processing systems. As an element of EP research, we present a semantic approach to extract profile of person entities from Farsi Web documents. Our approach includes three major components: (i) pre-processing, (ii) semantic analysis and (iii) attribute extraction. First, our system takes as input the raw text, and annotates the text using existing pre-processing tools. In semantic analysis stage, we analyze the pre-processed text syntactically and semantically and enrich the local processed information with semantic information obtained from a distant knowledge base. We then use a semantic rule-based approach to extract the related information of the persons in question. We show the effectiveness of our approach by testing it on a small Farsi corpus. The experimental results are encouraging and show that the proposed method outperforms baseline methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Cross–linguistic Comparison of Refusal Speech Act: Evidence from Trilingual EFL Learners in English, Farsi, and Kurdish

To date, little research on pragmatic transfer has considered a multilingual situation where there is an interaction among three different languages spoken by one person. Of interest was whether pragmatic transfer of refusals among three languages spoken by the same person occurs from L1 and L2 to L3, L1 to L2 and then to L3 or from L1 and L1 (if there are more than one L1) to L2. This study ai...

متن کامل

A Composite Kernel Approach for Detecting Interactive Segments in Chinese Topic Documents

Discovering the interactions between persons mentioned in a set of topic documents can help readers construct the background of a topic and facilitate comprehension. In this paper, we propose a rich interactive tree structure to represent syntactic, content, and semantic information in text. We also present a composite kernel classification method that integrates the tree structure with a bigra...

متن کامل

Rotation and Scale Invariant Feature Extraction Using Complex Zernike Moments Forfarsiand Arabic Handwriting Character

Analyzing Farsi and Arabic handwritten documents is one area in image processing whose target is to transform picture documents into symbolic form. This transformation is conducted o make rapid and easy saving, improvements, retrieval, reuse, searching and transferring documents. Analyzing documents is performed in five stages: pre-processing, segmentation representation, recognition and post-p...

متن کامل

Towards a Semantic Information Extraction Approach from Unstructured Documents

Recognizing and extracting meaningful information from semiand unstructured documents, taking into account their semantics, and storing them into database is an important problem in the context of information access and retrieval. This paper describes a novel logic-based approach to information extraction from both semiand unstructured documents. The approach, implemented in the HıLεX system, i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017